The theory of statistical hypothesis testing was developed in the early 20th century. Among other uses,

it was designed to apply the scientific method to data sampled from populations. In the following

sections, we explain the steps of hypothesis testing, the potential results, and possible errors that can

be made when interpreting a statistical test. We also define and describe the relationships between

power, sample size, and effect size in testing.

Getting the language down

Here are some of the most common terms used in hypothesis testing:

Null hypothesis (abbreviated

): The assertion that any apparent effect you see in your data is

not evidence of a true effect in the population, but is merely the result of random fluctuations.

Alternate hypothesis (abbreviated

or

): The assertion that there indeed is evidence in

your data of a true effect in the population over and above what would be attributable to random

fluctuations.

Significance test: A calculation designed to determine whether

can reasonably explain what

you see in your data or not.

Significance: The conclusion that random fluctuations alone can’t account for the size of the effect

you observe in your data. In this case,

must be false, so you accept

.

Statistic: A number that you obtain or calculate from your sample.

Test statistic: A number calculated from your sample that is part of performing a statistical test. It

can be for the purpose of testing

. In general, the test statistic is usually calculated as the ratio of

a number that measures the size of the effect (the signal) divided by a number that measures the size

of the random fluctuations (the noise).

p value: The probability or likelihood that random fluctuations alone (in the absence of any true

effect in the population) can produce the effect observed in your sample (or, at least as large as the

effect you observe in your sample). The p value is the probability of random fluctuations making

the test statistic at least as large as what you calculate from your sample (or, more precisely, at

least as far away from

in the direction of

).

Type I error: Choosing that

is correct when in fact, no true effect above random fluctuations

is present.

Alpha (α): The probability of making a Type I error.

Type II error: Choosing that

is correct when in fact there is indeed a true effect present that

rises above random fluctuations.

Beta (β): The probability of making a Type II error.

Power: The same as 1 – β, which is probability of choosing

as correct when in fact there is a

true effect above random fluctuations present.

Testing for significance

All the common statistical significance tests, including the Student t test, chi-square, and